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Abstract 

We study the problem of selecting a subset of k random variables from a large set, in order to obtain 
the best linear prediction of another variable of interest. This problem can be viewed in the context of 
both feature selection and sparse approximation. We analyze the performance of widely used greedy 
heuristics, using insights from the maximization of submodular functions and spectral analysis. We 
introduce the submodularity ratio as a key quantity to help understand why greedy algorithms perform 
well even when the variables are highly correlated. Using our techniques, we obtain the strongest known 
approximation guarantees for this problem, both in terms of the submodularity ratio and the smallest 
fc-sparse eigenvalue of the covariance matrix. 

We further demonstrate the wide applicability of our techniques by analyzing greedy algorithms for 
the dictionary selection problem, and significantly improve the previously known guarantees. Our theo- 
retical analysis is complemented by experiments on real-world and synthetic data sets; the experiments 
show that the submodularity ratio is a stronger predictor of the performance of greedy algorithms than 
other spectral parameters. 

1 Introduction 

We analyze algorithms for the following important Subset Selection problem: select a subset of k variables 
from a given set of n observation variables which, taken together, "best" predict another variable of interest. 
This problem has many applications ranging from feature selection, sparse learning and dictionary selection 
in machine learning, to sparse approximation and compressed sensing in signal processing. From a machine 
learning perspective, the variables could be features or observable attributes of a phenomenon, and we 
wish to predict the phenomenon using only a small subset from the high-dimensional feature space. In 
signal processing, the variables could coiTcspond to a collection of dictionary vectors, and the goal is to 
parsimoniously represent another (output) vector For many practitioners, the prediction model of choice is 
linear regression, and the goal is to obtain a linear model using a small subset of variables, to minimize the 
mean square prediction error or, equivalently, maximize the squared multiple correlation B? 161. 

Thus, we formulate the Subset Selection problem for regression as follows: Given the (normalized) 
covariances between n variables X/ (which can in principle be observed) and a variable Z (which is to be 
predicted), select a subset of k <^ n of the variables Xi and a linear prediction function of Z from the 
selected Xi that maximizes the B? fit. (A formal definition is given in Section |2]) The covariances are 
usually obtained empirically from detailed past observations of the variable values. 
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The above formulation is known lH to be equivalent to the problem of sparse approximation over 
dictionary vectors: the input consists of a dictionary of n feature vectors Xi G W^, along with a target 
vector z G M™, and the goal is to select at most k vectors whose linear combination best approximates z. 
The pairwise covariances of the previous formulation are then exactly the inner products of the dictionaiy 
vectorsjl] 

Our problem formulation appears somewhat similar to the problem of sparse recovery IITT] [TSl [191 [T] ; 
however, note that in sparse recovery, it is generally assumed that the prediction vector is truly (almost) 
fc-spai^se, and the aim is to recover the exact coefficients of this truly spai^se solution. However, finding a 
spai^se solution is a well-motivated problem even if the true solution is not sparse. Even then, running subset 
selection to find a sparse approximation to the correct solution helps to reduce cost and model complexity. 

This problem is NP-hard ifTTI . so no polynomial-time algorithms are known to solve it optimally for 
all inputs. Two approaches are frequently used for approximating such problems: greedy algorithms lITO] 
[T4l |5] [TtI and convex relaxation schemes |[T3l [T] [15] IH. For our fomiulation, a disadvantage of convex 
relaxation techniques is that they do not provide explicit control over the target sparsity level k of the 
solution; additional effort is needed to tune the regularization parameter 

A simpler and more intuitive approach, widely used in practice for subset selection problems (for exam- 
ple, it is implemented in all commercial statistics packages), is to use greedy algorithms, which iteratively 
add or remove variables based on simple measures of fit with Z. Two of the most well-known and widely 
used greedy algorithms are the subject of our analysis: Forward Regression [101 and Orthogonal Matching 
Pursuit |[T4l . (These algorithms are defined formally in Section [2ll. 

So far, the theoretical bounds on such greedy algorithms have been unable to explain why they perform 
well in practice for most subset selection problem instances. Most previous results for greedy subset se- 
lection algorithms |[5][14][2] have been based on coherence of the input data, i.e., the maximum conelation 
^ between any pair of variables. Small coherence is an extremely strong condition, and the bounds usu- 
ally break down when the coherence is ijj{l/k). On the other hand, most bounds for greedy and convex 
relaxation algorithms for sparse recovery are based on a weaker spai^se-eigenvalue or Restricted Isometry 
Property (RIP) condition |[T8l[T7l [9l[20l[ri. However, these results apply to a different objective: minimiz- 
ing the difference between the actual and estimated coefficients of a sparse vector Simply extending these 
results to the subset selection problem adds a dependence on the largest /c-sparse eigenvalue and only leads 
to weak additive bounds. More importantly, all the above results rely on spectral conditions that suffer from 
an inability to explain the performance of the algorithms for near-singular matrices. 

Eigenvalue-based bounds fail to explain an observation of many experiments (including ours in Section 
[5]l: greedy algorithms often perform very well, even for near-singular input matrices. Our results begin to 
explain these observations by proving that the performance of many algorithms does not really depend on 
how singular the covariance matrix is, but rather on how far the B? measure deviates from submodularity 
on the given input. We fonnalize this intuition by defining a measure of "approximate submodularity" 
which we term submodularity ratio. We prove that whenever the submodularity ratio is bounded away from 
0, the B? objective is "reasonably close" to submodular, and Forward Regression gives a constant-factor 
approximation. This significantly generalizes a recent result of Das and Kempe [2], who had identified a 
strong condition termed "absence of conditional suppressors" which ensures that the objective is actually 
submodular 

An analysis based on the submodularity ratio does relate with traditional spectral bounds, in that the 
ratio is always lower-bounded by the smallest A;-sparse eigenvalue of C (though it can be significantly larger 

' For this reason, the dimension m of the feature vectors only affects the problem indirectly, via the accuracy of the estimated 
covariance matrix. 
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when the predictor variable is not badly aligned with the eigenspace of small eigenvalues). In particular, we 
also obtain multiplicative approximation guarantees for both Forward Regression and Orthogonal Matching 
Pursuit, whenever the smallest A;-sparse eigenvalue of C is bounded away from 0, significantly strengthening 
past known bounds on their performance. 

An added benefit of our framework is that we obtain much tighter theoretical performance guarantees 
for greedy algorithms for dictionary selection |^. In the dictionary selection problem, we are given s target 
vectors, and a candidate set V of feature vectors. The goal is to select a set D C F of at most d feature 
vectors, which will serve as a dictionary in the following sense. For each of several target vectors, the best 
k < d vectors from D will be selected and used to achieve a good fit; the goal is to maximize the 
average fit for all of these vectors. (A formal definition is given in Section[2l) This problem of finding a 
dictionary of basis functions for sparse representation of signals has several applications in machine learning 
and signal processing. Krause and Cevher HI showed that greedy algorithms for dictionary selection can 
perform well in many instances, and proved additive approximation bounds for two specific algorithms, 
SDSma and SDSqmp (defined in Section |4l). Our approximate submodularity framework allows us to 
obtain much stronger multiplicative guarantees without much extra effort. 

Our theoretical analysis is complemented by experiments comparing the performance of the greedy 
algorithms and a baseline convex-relaxation algorithm for subset selection on two real-world data sets and a 
synthetic data set. More importantly, we evaluate the submodularity ratio of these data sets and compare it 
with other spectral parameters: while the input covariance matrices are close to singular, the submodularity 
ratio actually turns out to be significantly larger. Thus, our theoretical results can begin to explain why, in 
many instances, greedy algorithms perform well in spite of the fact that the data may have high correlations. 

Our main contributions can be summarized as follows: 

1. We introduce the notion of the submodularity ratio as a much more accurate predictor of the perfor- 
mance of greedy algorithms than previously used parameters. 

2. We obtain the strongest known theoretical performance guarantees for greedy algorithms for subset 
selection. In particular, we show (in Section |3]l that the Forward Regression and OMP algorithms are 
within a 1 — e~"' factor and 1 — e^'^'^'^™'") factor of the optimal solution, respectively (where the 7 
and A terms are appropriate submodularity and sparse-eigenvalue parameters). 

3. We obtain the strongest known theoretical guarantees for algorithms for dictionary selection, improv- 
ing on the results of LSJ. In particular", we show (in Section |4]) that the SDSma algorithm is within a 
factor j^i'^ ~ i) °f optimal. 

4. We evaluate our theoretical bounds for subset selection by running greedy and LI -relaxation algo- 
rithms on real-world and synthetic data, and show how the various submodular and spectral parame- 
ters correlate with the performance of the algorithms in practice. 

1.1 Related Work 

As mentioned previously, there has been a lot of recent interest in greedy and convex relaxation techniques 
for the sparse recovery problems, both in the noiseless and noisy setting. For LI relaxation techniques, Tropp 
|[T5l showed conditions based on the coherence (i.e., the maximum correlation between any pair of variables) 
of the dictionary that guaranteed near-optimal recovery of a sparse signal. In |[Tl|4l, it was shown that if the 
target signal is truly sparse, and the dictionary obeys a restricted isometry property (RIP), then LI relaxation 
can almost exactly recover the true sparse signal. Other results |[T9ll20l also prove conditions under which 
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LI relaxation can recover a sparse signal. Though related, the above results are not directly applicable to 
our subset selection formulation, since the goal in sparse recovery is to recover the true coefficients of the 
sparse signal, as opposed to our problem of minimizing the prediction error of an arbitrary signal subject to 
a specified sparsity level. 

For greedy sparse recovery, Zhang IITtI ITSI and Lozano et al. ||9l provided conditions based on sparse 
eigenvalues under which Forward Regression and Forward-Backward Regression can recover a sparse sig- 
nal. As with the LI results for sparse recovery, the objective function analyzed in these papers is somewhat 
different from that in our subset selection formulation; furthermore, these results are intended mainly for the 
case when the predictor variable is truly sparse. Simply extending these results to our problem formulation 
gives weaker, additive bounds and requires stronger conditions than our results. 

The papers by Das and Kempe lO, Gilbert et al. [5] and Tropp et al. lfT6l[T4ll analyzed greedy algorithms 
using the same subset selection formulation presented in this work. In particular, they obtained a 1 + 0(/x^A;) 
multiplicative approximation guarantee for the mean square error objective and a 1 — ©(^u/c) guarantee for 
the objective, whenever the coherence /i of the dictionary is 0(1/ A;). These results are thus weaker than 
those presented here, since they do not apply to instances with even moderate correlations of oj{l/k). 

Other analysis of greedy methods includes the work of Natarajan lITTI . which proved a bicriteria approx- 
imation bound for minimizing the number of vectors needed to achieve a given prediction enor. 

As mentioned earlier, the paper by Krause and Cevher [8 ] analyzed greedy algorithms for the dictionary 
selection problem, which generalizes subset selection to prediction of multiple variables. They too use a 
notion of approximate submodularity to provide additive approximation guarantees. Since their analysis 
is for a more general problem than subset selection, applying their results directly to the subset selection 
problem predictably gives much weaker bounds than those presented in this paper for subset selection. 
Furthemiore, even for the general dictionary selection problem, our techniques can be used to significantly 
improve their analysis of greedy algorithms and obtain tighter multipUcative approximation bounds (as 
shown in Section IDl. 

In general, we note that the performance bounds for greedy algorithms derived using the coherence 
parameter are usually the weakest, followed by those using the Restricted Isometry Property, then those 
using sparse eigenvalues, and finally those using the submodularity ratio. (We show an empirical comparison 
of these parameters in Section |5]) 

2 Preliminaries 

The goal in subset selection is to estimate a predictor variable Z using linear regression on a small subset 
from the set of observation variables V = {Xi, . . . We use Var(Xj), Co\{Xi,Xj) and p{Xi,Xj) 

to denote the variance, covariance and correlation of random variables, respectively. By appropriate nor- 
malization, we can assume that all the random variables have expectation and variance 1. The matrix of 
covariances between the Xi and Xj is denoted by C, with entries Cij = Cov{Xi, Xj). Similarly, we use 
b to denote the covariances between Z and the Xi, with entries bi = Cov{Z, Xi). Formally, the Subset 
Selection problem can now be stated as follows: 

Definition 2.1 (Subset Selection) Given pairwise covariances among all variables, as well as a parameter 
k, find a set S C V of at most k variables Xi and a linear predictor Z' = YlieS ^i -^i of Z, maximizing the 
squared multiple correlation SS} 




'Z,S — 



Var(Z) 
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is a widely used measure for the goodness of a statistical fit; it captures the fraction of the variance 
of Z explained by variables in S. Because we assumed Z to be normalized to have variance 1, it simplifies 
toRl, = l-^[{Z-Z'f]. 

For a set S, we use Cs to denote the submatrix of C with row and column set S, and hs to denote the 
vector with only entries hi for i ^ S. For notational convenience, we frequently do not distinguish between 
the index set S and the variables {Xi \ i G S}. Given the subset S of variables used for prediction, the 
optimal regression coefficients are well known to be = (ai)ies' = Cg^ ■ hs (see, e.g., 161), and hence 
i?| 5 = ^'s^s^^s- Thus, the subset selection problem can be phrased as follows: Given C, b, and k, select 
a set S of at most k variables to maximize ii| ^ = b^(C^^)b5jl 

The dictionary selection problem generalizes the subset selection problem by considering s predictor 
variables Zi, Z2, ■ ■ ■ , Zg. The goal is to select a dictionary D of d observation variables, to optimize the 
average fit for the Zi using at most k vectors from D for each. Formally, the Dictionary Selection 
problem is defined as follows: 

Definition 2.2 (Dictionary Selection) Given all pairwise covariances among the Zj and Xi variables, as 
well as parameters d and k, find a set D of at most d variables from {Xi , . . . , X„} maximizing 

s 

FiD) = V max RI a. 

Many of our results are phrased in terms of eigenvalues of the covariance matrix C and its submatrices. 
Since covariance matrices are positive semidefinite, their eigenvalues are real and non-negative ||6]. For 
any positive semidefinite n x n matrix A, we denote its eigenvalues by Amin(^) = '^i(^) < ^2{A) < 
... < Xn{A) = Amax(^)- We usc Amin (C, A;) = mins.|5|=fcAmin(Cs) to refer to the smallest eigen- 
value of any k x k submatrix of C (i.e., the smallest fc-spai^se eigenvalue), and similaiiy Xma.x{C,k) = 
max5.|5|=fc Amax(C5)- El We also use k{C, k) to denote the largest condition number (the ratio of the largest 
and smallest eigenvalue) of any kxk submatrix of C. This quantity is strongly related to the Restricted Isom- 
etry Property in [1]. We also use /^i(C) = max/^j |cjj| to denote the coherence, i.e., the maximum absolute 
pairwise coixelation between the Xi variables. Recall the L2 vector and matrix norms: ||x||2 = v^J^^^' 
and IIAII2 = Amax(^) = niaxj|x||2=i ||^x||2. We also use ||x||o = |{z | Xj / 0}| to denote the spai^sity of a 
vector X. 

The part of a variable Z that is not correlated with the Xi for all i £ S, i.e., the part that cannot be 
explained by the Xi, is called the residual (see |[3l), and defined as Res{Z, S) = Z — X^^g^ cuiXi. 

2.1 Submodularity Ratio 

We introduce the notion of submodularity ratio for a general set function, which captures "how close" to 
submodular the function is. We first define it for arbitrary set functions, and then show the specialization for 
the R^ objective. 

Definition 2.3 (Submodularity Ratio) Let f be a non-negative set function. The submodularity ratio of f 

with respect to a set U and a parameter k > 1 is 

. Z.esfiL^{x})-f{L) 
luMJ) Lcu,S:\W<k,SnL=d f{LuS)-f{L) 



^We assume throughout that C's is non-singular. For some of our results, an extension to singular matrices is possible using the 
Moore-Penrose generalized inverse. 

^ Computing Ainin(C, k) is NP-hard. In Appendix IaI we describe how to efficiently approximate the values for some scenarios. 
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Thus, it captures how much more f can increase by adding any subset S of size k to L, compared to the 
combined benefits of adding its individual elements to L. 

If f is specifically the E? objective defined on the variables Xi, then we omit f and simply define 

lUk = 9 ' O = mill T^rr, T) 

LCC7,5:|S|<A:,5nL=0 SUL ~ L LCU,S:\S\<k,SnL=<l) [h^y {C^)~^h^ 

where and are the normalized covariance matrix and normalized covariance vector corresponding 
to the set {Res(Xi, L), Res(X2, -L), . . . , Res{Xn-, L)}. 

It can be easily shown that / is submodular if and only if 7(7 ^ > 1, for all U and k. For the purpose of 
subset selection, it is significant that the submodularity ratio can be bounded in terms of the smallest sparse 
eigenvalue, as shown in the following lemma: 

Lemma 2.4 ju,k > Xmm{C,k + \U\) > Amin(C). 

For all our analysis in this paper, we will use \U\ = k, and hence 7(7 /; > Xmm{C, 2k). Thus, the smallest 
2A:-sparse eigenvalue is a lower bound on this submodularity ratio; as we show later, it is often a weak lower 
bound. 

Before proving Lemma 12.41 we first introduce two lemmas that relate the eigenvalues of normalized 
covariance matrices with those of its submatrices. 

Lemma 2.5 Let C be the covariance matrix of n zero-mean random variables Xi , X2 , • • • , Xn, each of 
which has variance at most 1. Let Cp be the corresponding correlation matrix of the n random variables, 
that is, Cp is the covariance matrix of the variables after they are normalized to have unit variance. Then 

Proof. Since Cp is obtained by normalizing the variables such that they have unit variance, we get Cp = 
D^CD, where D is a diagonal matrix with diagonal entries = . 

^Var(Xi) 

Since both Cp and C are positive semidefinite, we can perform Cholesky factorization to get lower- 
triangular matrices Ap and A such that C = AA'^ and Cp = ApAj. Hence Ap = D^A. 

Let crmin(^) and crmin(^p) denote the smallest singular values of A and Ap, respectively. Also, let v be 
the singular vector corresponding to cTmin 

(Ap). Then, 

||^V||2 = \\D~'^Ap^2 < ||L>~^||2ppV||2 = (Jmin(Ap) 11^-1 II2 < CJmin(^p), 

where the last inequality follows since 

||D^"'"||2 = max— = max Y^Var(Xj) < 1. 

i di i 

Hence, by the Courant-Fischer theorem, crnim(^) < crmin(^p), and consequently, Amm(C') < Amin(Cp)- 

■ 

Lemma 2.6 Let \m\n{C) be the smallest eigenvalue of the covariance matrix C of n random variables 
Xi,X2, . ■ ■ ,Xn, and Amin(C") be the smallest eigenvalue of the (n — 1) x (n — 1) covariance matrix 
C' corresponding to the n — 1 random variables Res(Xi,X„), . . . , Res(X„_i, Then Amin(C') < 

Amin(C")- 
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Proof. Let Aj and A'- denote the eigenvalues of C and C respectively. Also, let ■ denote the entries of 
C . Using the definition of the residual, we get that 



4^. = Cov(Res(Xi,X„),Res(X„X„)) = aj-^-^^, 
= Var(Res(X„X„)) = Q,, - ^. 



Defining L> = ^ • [ci,„, C2,„, . . . , Cn-i,n]^ • [ci,„, C2,„, . . . , c„„i,„], we can write C{i_...^„_i} = C' + D. 
To prove Ai < Ap let e' = [e'j^, . . . , e'„_]^]^ be the eigenvector of C corresponding to the eigenvalue A^ 
and consider the vector e = [e'^, e'g, . . . , e^_^, —-^Y^^Zi ^'i'^i,n]'^ ■ Then, C • e = [^], where 



^ n— 1 



' 1=1 



.,n-l} ■ e 



1 ^ / T / / / 

1' + • e' + C • e' 



' 1=1 



= C ■ e'. 

Thus, C-e = [A;e;,A;e'2,...,A;e;„i,0]^ = X[[e[,e'2, . . . ,e'„_^,Of < A; ||e||2, which by Rayleigh- 
Ritz bounds implies that Ai < A'^^. ■ 

Using the above two lemmas, we now prove Lemma IZ41 
Proof of Lemma l24l Since 



,,r.^,r ^ < max 1 = Amax((C5) ) 



(b^)^b^ - X x^x ^ XmUc§y 

we can use Definition 12.3 1 to obtain that 

lu,k> ^ ,min Amin(Cl'). 

(LCU,S:\S\<k,SnL=d) 

Next, we relate Xmini^s) ^^^^ X„im{CLus)^ using repeated applications of Lemmas |2.5I and |2?6l Let 
L = {Xi, . . . , Xi}; for each i, define Li = {Xi, . . . , Xi}, and let C^*-* be the covariance matrix of the 
random variables {Res(X, L\ Li) \ X G S U Li}, and Cp^ the covariance matrix after normalizing all its 
variables to unit variance. Then, Lemma 1231 implies that for each i, Amin(C'^'^) < X^m{Cp^), and Lemma 
12.61 shows that Amin(C'p*^) < Ajnin(C'^*~^^) for each i > 0. Combining these inequalities inductively for all 
i, we obtain that 

Amin(C'l') = Amin(C'^°'') > Amm(C'^^'') = XrainiClUs) > Amin (C, | L U 5| ) . 

Finally, since \S\ < k and L U,we obtain 7[/^fc > Xmm{C, k + \U\). ■ 



3 Algorithms Analysis 

We now present theoretical performance bounds for Forward Regression and Orthogonal Matching Pursuit, 
which are widely used in practice. We also analyze the Oblivious algorithm: one of the simplest greedy 
algorithms for subset selection. Throughout this section, we use OPT = max^.i^i^;, i?| ^ to denote the 
optimum value achievable by any set of size k. 
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3.1 Forward Regression 

We first provide approximation bounds for Forward Regression, which is the standard algorithm used by 
many researchers in medical, social, and economic domains]^ 

Definition 3.1 (Forward Regression) The Forward Regression (also called Forward Selection) algorithm 
for subset selection selects a set S of size k iteratively as follows: 

1: Initialize Sq = 0. 

2; for each iteration i + I do 

3: Let Xm. be a variable maximizing s-u{X }' '^^'^ = SiU {Xm}- 

4: Output Sk- 

Our main result is the following theorem. 
Theorem 3.2 The set 5™ selected by forward regression has the following approximation guarantees: 

R%^s^R > (l-e^^s^.fc) .(9Pr 

> (1 - e-^min(C,2fc)-j . Qpj^ 

> [I — e-^min(C,fc)-j _ Q(^(^l-)l/Amm{C,fc)-j . Qp^ 

Before proving the theorem, we first begin with a general lemma that bounds the amount by which the 
R^ value of a set and the sum of R^ values of its elements can differ. 

Lemma 3.3 Xn,l{C) ^^=1 ^ ^z,{x^,...,x„} - ^ EiLi Rz,x, ^ \~(c) ^i=i ^z,x,- 

Proof. Let the eigenvalues of be A'^ < A2 < • • • < A'„, with corresponding orthonormal eigenvectors 
ei, e2, . . . , Cn. We write b in the basis {ei, e2, . . . , en} as b = ^ ■ /3jei. Then, 

^z,{Xi,...,x„} = b^C^^b = ^/3fA-. 

i 

Because X[ < X'- for all i, we get A'^^ ■ (3f < i?| and j3f = h^h = i?| because the 

length of the vector b is independent of the basis it is written in. Also, by definition of the submodularity 
ratio, i?| X } — ■ Finally, because = j — \c)' ^'^'^ using Lemma [2!4l we obtain the result. 

The next lemma relates the optimal R^ value using k elements to the optimal R^ using k' < k elements. 

Lemma 3.4 For each k, let € argmax|5|<;. R^ g be an optimal subset of at most k variables. Then, for 
any k' = @{k) such that ^ . < k' < k, we have that R\ s*, ^ -^1,5^ " 6((|^)^/'^™'"^'^'''^), for large 

enough k. In particular, i?| ^* > i?| g, ■ 0((i)^/'^™'"('^''^)), /or large enough k. 

"'There is some inconsistency in the literature about naming of greedy algorithms. Forward Regression is sometimes also referred 
to as Orthogonal Matching Pursuit (OMP). We choose the nomenclature consistent with (Tol and 1141 . 
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Proof. We first prove that i?| ... ^ > (1 - k\^,l{c,k) )^z,s*- Let T = Res(Z, SI); then, Cov(Xi, T) = 
for all Xi S SI, and Z = T + ^Xi65* '^i-^i' where a = (aj) = Cg} ■ hg* are the optimal regression 
coefficients. We write Z' = Z — T. For any Xj G Sl,by definition of R^, we have that 



,2 _ 1 a|Var(Xj) _ 

^7/ c*\ "1 — t — , , — i 



Z',si\{x,} Var(Z') Var(Z') ' 

in particular; this implies that ^, > 1 — y^^j-^^,.^ for all Xj G 5*1.. 



Focus now on j minimizing aj, so that aj < -^^r^- As in the proof of Lemma l3.3[ by writing a in terms 
of an orthonormaleigenbasis of C5*, one can show that I a (75*0! > ||a||2Amm(C'5*)> or ||a||2 < ^ (Cg*) " 



Furthermore, o'^C^'a = Var ( ^ j^, g ^, a.jXj) = Var(Z'), so R%i > 1 — ■ \cg* ) • Firi^Uy' by 
definition, ^* = 1, so 

-^z,5; , ^z',si , 1 1 



Now, applying this inequality repeatedly, we get 

k 



R\si, > Rz,s* -11^^ 



1 



Let t = \1/ Aniin(C', k)] , so that the previous bound implies i2| ^, > i2| ^, • HiLfe'+i Most of the 
terms in the product telescope, giving us a bound of i?| g* ■ ni=i T^J+T- Since ni=i \-t+i converges to 
(^)* with increasing k (keeping t constant), we get that for large k, 



Using the above lemmas, we now prove the main theorem. 

Proof of Theorem 13.21 We begin by proving the first inequality. Let S^ be the optimum set of variables. 
Let Sf be the set of variables chosen by Forward Regression in the first i iterations, and Si = SI \ S^. By 
monotonicity of R'^ and the fact that Si U Sf 5 5^, we have that i?| ^^^q > OPT. 

For each Xj G Si, let Xj = Res{Xj, Sf^) be the residual of Xj conditioned on Sf^, and write S*^ = 

{Xj I X, G S}. 

We will show that at least one of the X- is a good candidate in iteration i + 1 of Forward Regression. 
First, the joint contribution of 5^' must be fairly large: R'^zYics{S' s^) ~ R^z s' — OPT — R\ gc- Using 
Definition|231 as well as Sf C S^^ and < /c, 

^ R\,xr > lsG,\s,\- Rz,s'^ ^ 7s™,fc • 
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Let i maximize i?| i.e., I S argmaxQ-.j^^^/gg/-) i?^ Then we get that 

Define A(i) = q — B? to be the gain obtained from the variable chosen by Forward Regression 
in iteration i. Then = "Y^^^i ^(*)- Since the above was a candidate to be chosen in iteration i + 

and Forward Regression chose a variable such that B? ^ R% d ^ -v^ eG^ for -'^ ^ Sf^' we 

obtain that 

i 

^(i + 1) > \^-Rzs' ^ ^V^(OPT-i?|^G) > ^^(OPT- V^(j)). 

Since the above inequality holds for each iteration i = l,2,...,A;, a simple inductive proof estabUshes 
the bound OPT - Y^Li M^) < OPT ■ (1 - Hence, 



^z,SFR = ^M^) > OPT - 0PT(1 - > OPT- (1 -e"^sFR,fc). 

i=l 



The second inequality follows directly from Lemma IZ41 and the fact that \S^\ = k. By applying the 
above result after k/2 iterations, we obtain R\ q > (1 — e~^™'n{C,fe)^ . ^2 Now, using Lemma [34] 

and monotonicity of R^, we get 

proving the third inequaUty. ■ 
3.2 Orthogonal Matching Pursuit 

The second greedy algorithm we analyze is Orthogonal Matching Pursuit (OMP), frequently used in signal 
processing domains. 

Definition 3.5 (Orthogonal Matching Pursuit (OMP)) The Orthogonal Matching Pursuit algorithm for 
subset selection selects a set S of size k iteratively as follows: 

1: Initialize Sq = 0. 

2: for each iteration i + 1 do 

3: Let Xjn be a variable maximizing |Cov(Res(Z, Si),Xm)\, and set S'j+i = S'j U {Xm}. 
4: Output Sk- 

By applying similar techniques as in the previous section, we can also obtain approximation bounds for 
OMP. We start by proving the following lemma that lower-bounds the variance of the residual of a variable. 

Lemma 3.6 Let Abe the (n+1) x covariance matrix of the normalized variables Z, Xi, X2, . . . , X„. 

Then Var(Res(Z, {Xi, . . . , Xn])) > Xmin{A). 
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Proof. The matrix A is of the fomi A = ^ ) ' ^[^li] denote the matrix obtained by 

removing the i^^ row and j^^ column of A, and similarly for C. Recalling that the entry of C^^ is 
dct{cf^^'''^'' ' developing the determinant of A by the first row and column, we can write 

n+l 



det{A) = ^(-1)1+^1,, det(A[l,j]) 



det(C7) + ^(-l)^6,det(A[l,j + l]) 



det{C) + Y^i-iyb, J](-l)*+i6, detiC[i,j]) 
j=i 1=1 



n n 



= det(C) - ^^(-1)^+^6,6^. det(C[f,j]) 
i=i i=i 

= det(C7)(l-b^C^ib). 
Therefore, using that Var(Z) = 1, 

Var(Res(Z,{Xi,...,X„})) = Var(Z) - b^'C^^b 



det(C) ■ 



Because det(A) = n"^i^ and det(C) = HLi , and A^^ < Af < < < . . . < A;^+i by the 
eigenvalue interlacing theorem, we get that > A^, proving the lemma. ■ 



The above lemma, along with an analysis similar to the proof of Theorem 13.21 can be used to prove the 
following approximation bounds for OMP: 

Theorem 3.7 The set S^^^ selected by orthogonal matching pursuit has the following approximation guar- 
antees: 

Rl,SOMF > (1 - e-(^S0MP,fc-An,in(C,2fe))^ _ ^pj. 

> (1 _ g-An,in(C,2fc)2^ . Qpp 

> (1 — e-Amin(C,fc)2^ _ Q(^(^l-)l/Amm(C,fc)^ . QPT. 

Proof. We begin by proving the first inequality. Using notation similar to that in the proof of Theorem 13.21 
we let 5^ be the optimum set of k variables, Sf^ the set of variables chosen by OMP in the first i iterations, 
and Si = Sl\ Sp. For each Xj G Si, let Xj = Res{Xj, ) be the residual of Xj conditioned on 5f , and 
write S'i = {X'j I Xj G S}. 

Consider some iteration i + 1 of OMP. We will show that at least one of the X'- is a good candidate in 
this iteration. Let £ maximize R\ x'^ ^ ^ ^-^S^^^QiX'sS') R^z x' - Lemma 3.7, 

Var(X;) > XminiCs^^ U{X'}) ^ Xmm{C,2k). 
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The OMP algorithm chooses a variable X.^ to add which maximizes | Gov (Res (Z, 5^), Thus, Xm 

maximizes 

Cov(Res(Z,5^),X„)2 = Cov(Z, Res(X™, 5^))2 = i?| j^^^^^^ ^.^^ • Var(Res(X„, 5^)). 
In particular, this implies 

p2 ^ p2 Var(X^) 2 Xmm{C,2k) ^ p2 \ cr* 9i.^ 

^z,Res(X™,5^) > ^^.^r Var(Res(X„,5^)) " "^^'^^ Var(Res(X^, 5^ ) " " ^minlO, 2fcJ, 

because Var(Res(Xm, 5'^)) < 1. As in the proof of Theoreml3.2l i2| ^, > ^^"^''■fc .^2 so i?| Res{x 5' ) - 
i?| ^, • J _ With the same definition of A{i) as in the previous proof, we get that A{i + 1) > 

"""^ ' (p _ ^(j))- An inductive proof now shows that 

k 

i=l 

The proofs of the other two inequalities follow the same pattern as the proof for Forward Regression. ■ 
3.3 Oblivious Algorithm 

As a baseline, we also consider a greedy algorithm which completely ignores C and simply selects the k 
variables individually most correlated with Z. 

Definition 3.8 (Oblivious) The Oblivious algorithm for subset selection is as follows: Select the k variables 
Xi with the largest hi values. 

Lemma l33] immediately implies a simple bound for the Oblivious algorithm: 

Theorem 3.9 The set S^^^ selected by the Oblivious algorithm has the following approximation guaran- 
tees: 



rI ^obl > — , . • OPT > V"'"": '7( • opt. 



70,fc ^n-T^ ^ Amin(C,/c) 



Proof. Let S be the set chosen by the Oblivious algorithm, and S^. the optimum set of k variables. By 
definition of the Oblivious algorithm, J2i£S — SieS* x using Lemma [331 we obtain that 



Xma.x{C,k) Xma.K{Cjk) ^maxiC^k) ' *= 

The second inequality of the theorem follows directly from Lemma l2!4l 



4 Dictionary Selection Bounds 

To demonstrate the wider applicability of the approximate submodularity framework, we next obtain a 
tighter analysis for two greedy algorithms for the dictionary selection problem, introduced in |l8]. 
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4.1 The Algorithm SDSma 



The SDSma algorithm generalizes the Oblivious greedy algorithm to the problem of dictionary selection. It 
replaces the R"^, g term in Definition 12.21 with its modular approximation f{Zj, S) = J2i£S ^Zj Xi- Thus, 

it greedily tries to maximize the function F{D) = '^j=iraay:s<zD,\s\=k fi^j, S), over sets D of size at 
most d; the inner maximum can be computed efficiently using the Oblivious algorithm. 

Definition 4.1 (SDSma) The SDSma algorithm for dictionary selection selects a dictionary D of size d 
iteratively as follows: 

I: Initialize Dq = 0. 

2: for each iteration i + 1 do 

3: Let Xm be a variable maximizing F{D U {Xm}), and set 5j+i = 5j U {Xm}- 

4: Output Dd- 

Using Lemma [331 we can obtain the following multiplicative approximation guarantee for SDSma : 

Theorem 4.2 Let D^^ be the dictionary selected by the SDSma algorithm, and D* the optimum dictionary 
of size \D\ < d, with respect to the objective F(D) from Definition \2.2\ Then, 

Proof. Let l) be a dictionary of size d maximizing F{D). Because f{Zj, S) is monotone and modular in 
5, F is a monotone, submodular function. Hence, using the submodularity results of Nemhauser et al. |[T2| 
and the optimality of ID for F, 

F(D^^) > F(D)(l--) > F(D*)(1--). 

e e 

Now, by applying Lemma [33] for each Zj, it is easy to show that F{D*) > 70 ^ • F{D*), and similarly 

FiD"^^) < Xu^UC, k) ■ F{D^^). Thus we get FiD"^^) > ^2t(C,k^ '^ " ^)^(^*)- 
The second part now follows from Lemma [f 



Note that these bounds significantly improve the previous additive approximation guarantee obtained in |f8l : 
F(D^^) > (1 - \)F{D*) - (2 - \)k ■ i^i{C). In particular, when i^i{C) > G(l/A;), i.e., even just one pair 
of variables has moderate correlation, the approximation guarantee of Krause and Cevher becomes trivial. 



4.2 The Algorithm SDSqmp 

We also obtain a multiplicative approximation guarantee for the greedy SDSqmp algorithm, introduced by 
Krause and Cevher for dictionary selection. Our bounds for SDSqmp are much stronger than the additive 
bounds obtained by Krause and Cevher. However, for both our results and theirs, the performance guarantees 
for SDSqmp are much weaker than those for SDSma- 

The SDSqmp algorithm generalizes the Orthogonal Matching Pursuit algorithm for subset selection 
to the problem of dictionary selection. In each iteration, it adds a new element to the currently selected 
dictionary by using Orthogonal Matching Pursuit to approximate the estimation of maxi^i^^ i?| g. 

Definition 4.3 (SDSqmp) The SDSqmp algorithm for dictionary selection selects a dictionary D of size d 
iteratively as follows: 
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1: Initialize Dq = 0. 

2: for each iteration i + \ do 

3: Let Xm be a variable maximizing X]j=i Somp{d \j{x } z k) ^^^^^ Somp{D, Z, k) denotes the 

set selected by Orthogonal Matching Pursuit for predicting Z using k variables from D. 
4: Set Si+i = SiU {Xm}. 
5: Output Dd- 

We now show how to obtain a multiplicative approximation guarantee for SDSqmp- The following 
definitions ai^e key to our analysis; the first two are from Definition |2]2] and Theorem 14.21 

s 

F(D) = V max RI o, 

s 

= ^R%,SoMf(D,Zj,k)- 

We first prove the following lemma about approximating the function F{D) by F{D): 
Lemma 4.4 For any set D, we have that 

Amax(C,fc) \ ) — \ J — ^^^^ 

Proof. Using Theorem 13 .7 1 and Lemma [33] and summing up over all the Zj terms, we obtain that 

F{D) > (l-e-^-in(C^'2'=)')-F(D) > (l-e-^-i'^(^'2fc)2)_£gL_. 
Similarly, using Lemma [331 and the fact that max^f-^) |5|=;j R^, g > R^ Somp{d z ky ^^^^ 

F{D) > i^,k-F{D) > -/Q,k-F{D). 

m 

Using the above lemma, we now prove the following bound for SDSqmp : 

Theorem 4.5 Let F)'^^^ be the dictionary selected by the SDSqmf algorithm, and D* the optimum dictio- 
nary of size \D\ < d, with respect to the objective F{D)from Definition \2.2\ Then, 

pijjOMP\ y pm*\ . ^^'^ . ^ '— > F(D*) • ' ^ . ^ 

~ Amax(C, /c) d-d-p--f(l,^k + '^ ~ Amax(C, /c) d - d • p • 70^fc + 1 ' 

where n = \ r • fl — p^-^min(C',2fc)^ "\ 
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Proof. Let D be the dictionaiy of size d that maximizes F{D). We first prove that F{D^^^) is a good 
approximation to F{D). 

Let be the variables chosen by SDSqmp after i iterations. Define Si = D \ Sf^. By monotonicity of 
F, we have that F{Si U 5f ) > F{D). 

Let X G 5i be the variable maximizing F{Sj^U{X}), and similarly X S 5j be the variable maximizing 

HS? u {X}). 

Since F is a submodular function, it is easy to show (using an argument similar to the proof of Theorem 
[321) that F(5f U {X}) - F{Sf^) > ^(^)-^(^F) . 

Now, using Lemma l44l above, and the optimality of X for F{S^ U {^}), we obtain that 

^.FiS^U{X}) > F(5fu{X}) > FiS^U{X}) > p-F(5fu{X}). 
Thus, U {X}) > p ■ 70,fc • U {X}), or 

U {X}) - ) > p ■ 70,fc • (F(5f U {!}) - F(5f )) - (1 - p • 70,fc)i^(Sf )• 

Define ^(i) = F{Sf) — F{S^_i) to be the gain, with respect to F, obtained from the variable chosen 
by SDSoMP in iteration i. Then F{D'^^^) = Yli=i ^(^)- From the preceding paragraphs, we obtain 



A{^ + 1) > ^ . iF{D) - (1 + -d)Y: A{j)). 

Since the above inequality holds for each iteration i = 1, 2, . . . , a simple inductive proof shows that 



X: A(z) < F{b) . (1 - ^)'^ + (d - dp70,.) • E 

4=1 i=l 



Rearranging the temis and simplifying, we get that 

where the last inequality is due to the optimality of D for F. 

Now, using Lemma [331 for each Zj term, it can be easily seen that F{D*) > 70 • F{D*). Similarly, 
using Lemma 3.3 on the set 1)°^^, we have F{D^^^) > ^^^^ • F{D°^^). 



Using the above inequalities, we therefore get the desired bound 



Amax(C, /c) d - d • 70^fc + 1' 

The second inequality of the Theorem now follows directly from Lemma 1241 



5 Experiments 

In this section, we evaluate Forward Regression (FR) and OMP empirically, on two real-world and one 
synthetic data set. We compare the two algorithms against an optimal solution (OPT), computed using 
exhaustive search, the Oblivious greedy algorithm (OBL), and the Ll-regularization/Lasso (LI) algorithm 
(in the implementation of Koh et al. [71). Beyond the algorithms' performance, we also compute the various 
spectral parameters from which we can derive lower bounds. Specifically, these are 
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1. the submodularity ratio: 75FR f., where S is the subset selected by forward regression. 

2. the smallest sparse eigenvalues Xmin{C, k) and Amin(C'i 2A;). (In some cases, computing Amm(C, 2k) 
was not computationally feasible due to the problem size.) 

3. the spai^se inverse condition number k(C, A;)"^. As mentioned eaiiier, the sparse inverse condition 
number k(C, /c) is strongly related to the Restricted Isometry Property in lU. 

4. the smallest eigenvalue Amin(C') = Amin(C', n) of the entire covariance matrix. 

The aim of our experiments is twofold: First, we wish to evaluate which among the submodular and 
spectral parameters are good predictors of the performance of greedy algorithms in practice. Second, we 
wish to highlight how the theoretical bounds for subset selection algorithms reflect on their actual per- 
fomiance. Our analytical results predict that Foward Regression should outperform OMP, which in turn 
outperforms Oblivious. For Lasso, it is not known whether strong multiplicative bounds, like the ones we 
proved for Forward Regression or OMP, can be obtained. 

5.1 Data Sets 

Because several of the spectral parameters (as well as the optimum solution) are NP-hard to compute, we 
restrict our experiments to data sets with n < 30 features, from which /c < 8 are to be selected. We stress 
that the greedy algorithms themselves are very efficient, and the restriction on data set sizes is only intended 
to allow for an adequate evaluation of the results. 

Each data set contains m > n samples, from which we compute the empirical covariance matrix (anal- 
ogous to the Gram matrix in sparse approximation) between all observation variables and the predictor 
variable; we then normalize it to obtain C and b. We evaluate the performance of all algorithms in terms of 
their B? fit; thus, we implicitly treat C and b as the ground truth, and also do not separate the data sets into 
training and test cases. 

Our data sets are the Boston Housing Data, a data set of World Bank Development Indicators, and a 
synthetic data set generated from a distribution similar to the one used by Zhang fTTI. The Boston Housing 
Data (available from the UCI Machine Learning Repository) is a small data set frequently used to evaluate 
ML algorithms. It comprises n = 15 features (such as crime rate, property tax rates, etc.) and m = 516 
observations. Our goal is to predict housing prices from these features. The World Bank Data (available 
from http : / /databank . worldbank . org) contains an extensive list of socio-economic and health 
indicators of development, for many countries and over several years. We choose a subset of n = 29 
indicators for the years 2005 and 2006, such that the values for all of the m = 65 countries are known for 
each indicator. (The data set does not contain all indicators for each country.) We choose to predict the 
average life expectancy for those countries. 

To perform tests in a controlled fashion, we also generate random instances from a known distribution 
similar to [17 1: There are n = 29 features, and m = 100 data points are generated from a joint Gaussian 
distribution with moderately high correlations of 0.6. The target vector is obtained by generating coefficients 
uniformly from to 10 along each dimension, and adding noise with variance cj^ = 0.1. Notice that the 
target vector is not truly sparse. The plots we show are the average values for 20 independent runs of the 
experiment. 
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5.2 Results 



We run the different subset selection algorithms for values of k from 2 through 8, and plot the values 
for the selected sets. Figures [T] |3] and |5] show the results for the three data sets. The main insight is that 
on all data sets, Forward Regression performs optimally or near-optimally, and OMP is only slightly worse. 
Lasso performs somewhat worse on all data sets, and, not surprisingly, the baseline Oblivious algorithm 
performs even worse. The order of performance of the greedy algorithms match the order of the strength of 
the theoretical bounds we derived for them. 

On the World Bank data (Figure|3]l, all algorithms perform quite well with just 2-3 features already. The 
main reason is that adolescent birth rate is by itself highly predictive of life expectancy, so the first feature 
selected by all algorithms already contributes high value. 



0.74 




k k 
Figure 1: Boston Housing R^ Figure 2: Boston Housing parameters 




Figure 3: World Bank R^ Figure 4: World Bank parameters 

Figures 121 m and [6] show the different spectral quantities for the data sets, for varying values of k. Both 
of the real- world data sets are nearly singular, as evidenced by the small Amin(C') values. In fact, the near 
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k k 
Figure 5: Synthetic Data Figure 6: Synthetic Data parameters 

singularities manifest themselves for small values of k already; in paiticulai^, since Aniin(C', 2) is already 
small, we observe that there are pairs of highly correlated observations variables in the data sets. Thus, the 
bounds on approximation we would obtain by considering merely Amin(C', A;) or Amin(C, 2k) would be quite 
weak. Notice, however, that they are still quite a bit stronger than the inverse condition number k{C, k)~^: 
this bound — which is closely related to the RIP property frequently at the center of sparse approximation 
analysis — takes on much smaller values, and thus would be an even looser bound than the eigenvalues. 

On the other hand, the submodularity ratios 75FR for all the data sets are much larger than the other 
spectral quantities (almost 5 times larger, on average, than the con^esponding Amin(C') values). Notice that 
unlike the other quantities, the submodularity ratios are not monotonically decreasing in k — this is due to 
the dependency of 75FR ^ on the set S^, which is different for every k. 

The discrepancy between the small values of the eigenvalues and the good performance of all algorithms 
shows that bounds based solely on eigenvalues can sometimes be loose. Significantly better bounds are 
obtained from the submodularity ratio 7sfr , which takes on values above 0.2, and significantly larger in 
many cases. While not entirely sufficient to explain the performance of the greedy algorithms, it shows 
that the near-singularities of C do not align unfavorably with b, and thus do not provide an opportunity for 
strong supermodular behavior that adversely affects greedy algorithms. 

The synthetic data set we generated is somewhat further from singular, with Amin(C) ~ 0.11. However, 
the same patterns persist: the simple eigenvalue based bounds, while somewhat larger for small k, still do 
not fully predict the performance of greedy algorithms, whereas the submodularity ratio here is close to 1 
for all values of k. This shows that the near-singularities do not at all provide the possibility of strongly 
supermodular benefits of sets of variables. Indeed, the plot of values on the synthetic data is concave, an 
indicator of submodulai" behavior of the function. 

The above observations suggest that bounds based on the submodularity ratio ai^e better predictors of the 
performance of greedy algorithms, followed by bounds based on the sparse eigenvalues, and finally those 
based on the condition number or RIP property. 
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5.3 Narrowing the gap between theory and practice 

Our theoretical bounds, though much stronger than previous results, still do not fully predict the observed 
near-optimal performance of Forward Regression and OMP on the real-world datasets. In particular, for For- 
ward Regression, even though the submodularity ratio is less than 0.4 for most cases, implying a theoretical 
guarantee of roughly 1 — e"*^ "^ ^ 33%, the algorithm still achieves near-optimal performance. While gaps 
between worst-case bounds and practical performance are commonplace in algorithmic analysis, they also 
suggest that there is scope for further improving the analysis, by looking at more fine-grained parameters. 

Indeed, a slightly more careful analysis of the proof of Theorem 13.21 and our definition of the submod- 
ularity ratio reveals that we do not really need to calculate the submodularity ratio over all sets S of size k 
while analyzing the greedy steps of Forward Regression. We can ignore sets S whose submodularity ratio 
is low, but whose marginal contribution to the current is only a small fraction (say, at most e). This is 
because the proof of Theorem 13 . 2 1 show s that for each iteration z + 1, we only need to consider the submod- 
ularity ratio for the set Si = Sl.\ Sf^, where is the set selected by the greedy algorithm after i iterations, 
and 5^ is the optimal /c-subset. Thus, if i?| s us^ < (1 + e) • ^| 50 then the currently selected set must 

akeady be within a factor of optimal. 

By carefully pruning such sets (using e = 0.2) while calculating the submodularity ratio, we see that the 
resulting values of 75FR are much higher (more than 0.8), thus significantly reducing the gap between the 
theoretical bounds and experimental results. Table [T]shows the values of 75FR j. obtained using this method. 

The results suggest an interesting direction for future work: namely, to characterize for which sets the 
submodular behavior of really matters. 



Data Set 


k = 2 


A; = 3 


k = A 


k = 5 


k = 6 


k = 7 


k = 8 


Boston 


0.9 


0.91 


1.02 


1.21 


1.36 


1.54 


1.74 


World Bank 


0.8 


0.81 


0.81 


0.81 


0.94 


1.19 


1.40 



Table 1 : Improved estimates for submodularity ratio 



6 Discussion and Concluding Remarks 

In this paper, we analyze greedy algorithms using the notion of submodularity ratio, which captures how 
close to submodular an objective function (in our case the R^ measure of statistical fit) is. Using submodu- 
lar analysis, coupled with spectral techniques, we prove the strongest known approximation guarantees for 
commonly used greedy algorithms for subset selection and dictionary selection. Our bounds help explain 
why greedy algorithms perform well in practice even in the presence of strongly conelated data, and ai^e 
substantiated by experiments on real-world and synthetic datasets. The experiments show that the submod- 
ularity ratio is a much stronger predictor of the performance of greedy algorithms than previously used 
spectral parameters. We believe that our techniques for analyzing greedy algorithms using a notion of "ap- 
proximate submodularity" are not specific to subset selection and dictionary selection, and could also be 
used to analyze other problems in compressed sensing and sparse recovery. 

References 

[1] E. J. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measure- 
ments. Communications on Pure and Applied Mathematics, 59:1207-1223, 2005. 



19 



[2] A. Das and D. Kempe. Algorithms for subset selection in linear regression. In ACM Symposium on 
Theory of Computing, 2008. 

[3] G. Diekhoff. Statistics for the Social and Behavioral Sciences. Wm. C. Brown Publishers, 2002. 

[4] D. Donoho. For most large underdetermined systems of linear equations, the minimal 11-norm near- 
solution approximates the sparsest near-solution. Communications on Pure and Applied Mathematics, 
59:1207-1223, 2005. 

[5] A. Gilbert, S. Muthukrishnan, and M. Strauss. Approximation of functions over redundant dictionaries 
using coherence. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2003. 

[6] R. A. Johnson and D. W. Wichern. Applied Multivariate Statistical Analysis. Prentice Hall, 2002. 

[7] K. Koh, S. Kim, and S. Boyd. llJs: Simple Matlab Solver for 11 -regularized Least Squares Problems, 
2008. http://www.stanford.edu/ boyd/ll_ls. 

[8] A. Krause and V. Cevher. Submodular dictionary selection for sparse representation. In Proc. ICML, 
2010. 

[9] A. C. Lozano, G. Swirszcz, and N. Abe. Grouped orthogonal matching pursuit for variable selection 
and prediction. In Proc. NIPS, 2009. 

[10] A. Miller. Subset Selection in Regression. Chapman and Hall, second edition, 2002. 

[11] B. Natarajan. Sparse approximation solutions to Unear systems. SI AM Journal on Computing, 24:227- 
234, 1995. 

[12] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of the approximations for maximizing submod- 
ular set functions. Mathematical Programming, 14:265-294, 1978. 

[13] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society, 
58:267-288, 1996. 

[14] J. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Trans. Information 
Theory, 50:2231-2242, 2004. 

[15] J. Tropp. Just relax: Convex programming methods for identifying sparse signals. IEEE Trans. Infor- 
mation Theory, 51:1030-1051, 2006. 

[16] J. Tropp, A. Gilbert, S. Muthukrishnan, and M. Strauss. Improved sparse approximation over quasi- 
incoherent dictionaries. In Proc. lEEE-ICIP, 2003. 

[17] T. Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In 
Proc. NIPS, 2008. 

[18] T. Zhang. On the consistency of feature selection using greedy least squares regression. Journal of 
Machine Learning Research, 10:555-568, 2009. 

[19] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 
7:2451-2457, 2006. 

[20] S. Zhou. Thresholding procedures for high dimensional variable selection and statistical estimation. 
In Proc. NIPS, 2009. 



20 



A Estimating Aniin(C', k) 



Several of our approximation guarantees are phrased in temis of Amin(C', /c). Finding the exact value of 
Ajnin(C, k) is NP-hard in general; here, we show how to estimate lower and upper bounds. Let Ai < A2 < 
. . . < A„ be the eigenvalues of C, and ei, 62, . . . , en the corresponding eigenvectors. A first simple bound 
can be obtained directly from the eigenvalue interlacing theorem: Ai < Amin(C'5 k) < Xn^k+i- 

One case in which good lower bounds on Aniin(C', k) can possibly be obtained is when only a small 
(constant) number of the Aj ai^e small. The following lemma allows a bound in temis of any Xj ; however, 
since the running time by the implied algorithm is exponential in j, and the quality of the bound depends on 
Xj, it is useful only in the special case when Xj » for a small constant j. 

Lemma A.l Let Vj be the vector space spanned by the eigenvectors ei, e2, . . . , ej, and define 

/3j = max |x • y|. 

y6V3,xeIR",||y||2 = i|x||2=l,||x||o<fc 

Then, Xmm{C, k) > Xj+i ■ (1 - /3j). 

Proof. Let x' G M", ||x'||2 = 1, ||x||o < A; be an eigenvector corresponding to XminiC, k). Let be the 
coefficients of the representation of x' in terms of the ej: x' = Y^^=i Q^i^i- Thus, Yll=i ^1 — 1' ^^^^ we can 
write 

n j 

Xmin{C,k) = x'^Cx' = Y^a^Xi > Xj+i{l-Y,(^l)- 

1=1 i=l 

Since J2i=i the length of the projection of x onto Vj, we have 

max |x' • y| < max ly'^L 

y6V,-,||y||2=l yeV,-,xeM",||x||2=||y||2=l,||x||o<fe 

completing the proof. ■ 

Since all the Aj can be computed easily, the crux in using this bound is finding a good bound on 13 j. 
Next, we show a PTAS (Polynomial-Time Approximation Scheme) for approximating /3j, for any constant 
3- 

Lemma A. 2 For every e > 0, there is al — e approximation for calculating f3j, running in time 0((^)-'). 

Proof. Any vector y ^ Vj with ||y||2 = 1 can be written as y = J2i=i Vi^i with rji G [—1, 1] for all i. The 
idea of our algorithm is to exhaustively search over all y, as parametrized by their rji entries. To make the 
search finite, the entries are discretized to multiples of 6 = e ■ 's/k/{nj). The total number of such vectors 
to search over is {2/6y < (n/e)-? . 

Let X, y attain the maximum in the definition of Pj, and write y = Yli=i V-i^i- For ^^^h i, let r]i be f)j, 
rounded to the nearest multiple of 5, and y = J2i=i Then, ||y — y II2 < ||^ Z^i=i ej II2 = 

The vector x' = argmax^gKn ||x||2=i jx||o<fc ly " ^1 °f following form: Let / be the set of k indices 
i such that \yi \ is lai'gest, and 7 = vf- Then, x[ = for z ^ / and x[ = yi/^ for i G /. Notice that 

given y, we can easily find x', and because |x-y| < |x'-y| < |x-y|,we have 

|x-y| |x-y| |x-y| |x-y| 
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The last inequality follows since the sum of the k largest entries of y is at least kj \/n, so by setting Xj = 
1 / \fk for each of those coordinates, we can attain at least an inner product of sjkjn, and the inner product 
with X cannot be smaller. 

The value output by the exhaustive search over all discretized values is at least |x' • y|, and thus within 
a factor of 1 — = 1 — e of the maximum value, attained by x, y. ■ 
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